Non - shared disk cluster – a fault tolerant , commodity approach to hi - bandwidth data analysis
نویسندگان
چکیده
The STAR experiment, in collaboration with the NERSC Scientific Data Management Group is prototyping / developing an approach to accomplish a high bandwidth data analysis capability using commodity components in a fault tolerant fashion. The prototype hardware consists of two small clusters of linux nodes (about 10 dual-CPU nodes), a few 100 GB local disk on each node. One cluster is at the RCF at Brookhaven and the other at PDSF at LBNL/NERSC. The local disk on each node is not exported on the network so that all processing of data occurs on processors with locally attached disk. A file catalog is used to track and manage the placement of data on the local disks and is also used to coordinate the processing with nodes having the requested data. This paper describes the current status of this project as well as describe the development plans for a full scale implementation, consisting of 10s TB of disk capacity and more than 100 processors. Initial ideas for a full implementation include modifications to the HENP Grand Challenge software, STACS (http://gizmo.lbl.gov/stacs) that combines queries of the central tag database to define analysis tasks and a new component, a parallel job dispatcher, to split analysis tasks into multiple jobs submitted to nodes containing the requested data, or to nodes with space to stage some of the requested data. The system is fault tolerant to the extent that individual data nodes may fail without disturbing the processing on other nodes, and the failed node can be restored. Jobs interrupted on the failed node are restarted on other nodes with the necessary data.
منابع مشابه
Providing Single I/O Space and Multiple Fault Tolerance in a Distributed RAID
Commodity EIDE disks provide low cost storage but are severely limited in bandwidth and cannot be made fault-tolerant. On the other hand, conventional RAID devices provide reliability and performance but worse price/performance figures. A cluster of PCs can be seen as a collection of networked low cost disks; such a collection can be operated by proper software so as to provide the abstraction ...
متن کاملSynchronizing Shared Memory in the SEQUOIA Fault-Tolerant Multiprocessor
There are three dominent themes in building high transaction rate multiprocessor systems, namely shared memory (e.g. Synapse, IBM/AP configurations), shared disk (e.g. VAX/cluster, any multi-ported disk system), and shared nothing (e.g. Tandem, Tolerant). This paper argues that shared nothing is the preferred approach.
متن کاملData Replication Strategies for Fault Tolerance and Availability on Commodity Clusters
Recent work has shown the advantages of using persistent memory for transaction processing. In particular, the Vista transaction system uses recoverable memory to avoid disk I/O, thus improving performance by several orders of magnitude. In such a system, however, the data is safe when a node fails, but unavailable until it recovers, because the data is kept in only one memory. In contrast, our...
متن کاملUsing a Gigabit Ethernet Cluster as a Distributed Disk Array with Multiple Fault Tolerance
A cluster of PCs can be seen as a collection of networked low cost disks; such a collection can be operated by proper software so as to provide the abstraction of a single, larger block device. By adding suitable data redundancy, such a disk collection as a whole could act as single, highly fault tolerant, distributed RAID device, providing capacity and reliability along with the convenient pri...
متن کاملRecovery of Memory and Process in DSM
multiprocessor, shared memory, high availability In this report, we discuss the recovery of memory and processes on the platform of a shared-memory DSM system. We divide the problem into recovery of unaffected memory (RUM), and recovery of affected processes (RAP). We point out that specially designed faulttolerant, non-volatile memory is neither sufficient nor necessary to solve the problem of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001